Adaptive Importance Sampling with Automatic Model Selection in Value Function Approximation
نویسندگان
چکیده
Off-policy reinforcement learning is aimed at efficiently reusing data samples gathered in the past, which is an essential problem for physically grounded AI as experiments are usually prohibitively expensive. A common approach is to use importance sampling techniques for compensating for the bias caused by the difference between data-sampling policies and the target policy. However, existing off-policy methods do not often take the variance of value function estimators explicitly into account and therefore their performance tends to be unstable. To cope with this problem, we propose using an adaptive importance sampling technique which allows us to actively control the trade-off between bias and variance. We further provide a method for optimally determining the trade-off parameter based on a variant of cross-validation. We demonstrate the usefulness of the proposed approach through simulations. Introduction Policy iteration is a reinforcement learning setup where the optimal policy is obtained by iteratively performing policy evaluation and improvement steps (Sutton & Barto 1998). When policies are updated, many popular policy iteration methods require the user to gather new data samples following the updated policy and the new samples are used for value function approximation. However, this approach is inefficient particularly when the sampling cost is high since previously gathered data samples are simply discarded; it would be more efficient if we could reuse the data collected in the past. A situation where the sampling policy (a policy used for gathering data samples) and the current policy are different is called off-policy reinforcement learning (Sutton & Barto 1998) with few notable exceptions such as Qlearning (Watkins 1989) and policy search by dynamic programming (Bagnell et al. 2003). In the off-policy situation, simply employing a standard policy iteration method such as least-squares policy iteration (Lagoudakis & Parr 2003) does not lead to the optimal policy as the sample distribution is determined by the policies. Therefore, the sampling policy can introduce bias into Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the sample distribution. This distribution mismatch problem could be eased by the use of importance sampling techniques (Fishman 1996), which cancel the bias asymptotically. However, the approximation error is not necessarily small when the bias is reduced by importance sampling; the variance of estimators should also be taken into account since the approximation error is the sum of squared bias and variance. Due to large variance, existing importance sampling techniques tend to be unstable (Sutton & Barto 1998; Precup, Sutton, & Singh 2000). To overcome the instability problem, we propose using an adaptive importance sampling technique used in statistics (Shimodaira 2000). The proposed adaptive method, which smoothly bridges the ordinary estimator and the importance-weighted estimator, allows us to control the trade-off between bias and variance. Thus, given that the trade-off parameter is determined carefully, the optimal performance can be achieved in terms of both bias and variance. However, the optimal value of the trade-off parameter is heavily dependent on data samples and policies. For this reason, using a prefixed parameter value may not be always effective in practice. In order to optimally choose the value of the trade-off parameter, we reformulate the value function approximation problem as a supervised regression problem and propose using an automatic model selection method based on a variant of cross-validation (Sugiyama, Krauledat, & Müller 2007). The method called importance-weighted cross-validation enables us to estimate the approximation error of value functions in an almost unbiased manner even under off-policy situations. Thus we can actively determine the trade-off parameter based on data samples at hand. We demonstrate the usefulness of the proposed approach through simulations. Background and Notation In this section, we formulate the reinforcement learning problem. Markov Decision Processes. Let us consider a Markov decision process (MDP) (S,A, PT, R, γ), where S is a set of states, A is a set of actions, PT(s′|s, a) (∈ [0, 1]) is the transition probability density from state s to next state s′ when action a is taken, R(s, a, s′) (∈ R) is a reward for transition from s to s′ by taking action a, and γ (∈ [0, 1)) is the discount factor for future rewards. Let π(a|s) (∈ [0, 1]) be a stochastic policy which is the conditional probability density of taking action a given state s. Let Q(s, a) (∈ R) be a state-action value function for policy π which is the expected discounted sum of rewards the agent will obtain when taking action a in state s and following policy π thereafter. Q(s, a) ≡ E π,PT [ ∞ ∑ n=1 γn−1R(sn, an) ∣∣s1 = s, a1 = a ]
منابع مشابه
Uniform Convergence of Sample Average Approximation with Adaptive Multiple Importance Sampling
We study sample average approximations under adaptive importance sampling in which the sample densities may depend on previous random samples. Based on a generic uniform law of large numbers, we establish uniform convergence of the sample average approximation to the function being approximated. In the optimization context, we obtain convergence of the optimal value and optimal solutions of the...
متن کاملUniform Convergence of Sample Average Approximation with Adaptive Importance Sampling
We study sample average approximations under adaptive importance sampling. Based on a Banach-space-valued martingale strong law of large numbers, we establish uniform convergence of the sample average approximation to the function being approximated. In the optimization context, we obtain convergence of the optimal value and optimal solutions of the sample average approximation.
متن کاملAdaptive importance sampling for network growth models
Network Growth Models such as Preferential Attachment and Duplication/Divergence are popular generative models with which to study complex networks in biology, sociology, and computer science. However, analyzing them within the framework of model selection and statistical inference is often complicated and computationally difficult, particularly when comparing models that are not directly relat...
متن کاملAn Alternative Stability Proof for Direct Adaptive Function Approximation Techniques Based Control of Robot Manipulators
This short note points out an improvement on the robust stability analysis for electrically driven robots given in the paper. In the paper, the author presents a FAT-based direct adaptive control scheme for electrically driven robots in presence of nonlinearities associated with actuator input constraints. However, he offers not suitable stability analysis for the closed-loop system. In other w...
متن کاملAn Alternative Stability Proof for Direct Adaptive Function Approximation Techniques Based Control of Robot Manipulators
This short note points out an improvement on the robust stability analysis for electrically driven robots given in the paper. In the paper, the author presents a FAT-based direct adaptive control scheme for electrically driven robots in presence of nonlinearities associated with actuator input constraints. However, he offers not suitable stability analysis for the closed-loop system. In other w...
متن کامل